Auto Repair Shop Demand Planning

MGT 6203: Data Analytics in Business
July 24, 2022

Introduction

Team

Number

45

Members

Name Email
Lisa Chille
Nicholas Cunningham
Jeffrey Hedberg
Brittany Lange
Delband Taha

Background information

A hypothetical auto repair shop may struggle with demand forecasting since no one plans on getting into an accident. Therefore, none of their customers have any prior intent of availing their services before finding themselves with the challenge of getting their car repaired1. By statistically analyzing and making inferences from collision data, the shop stands a chance at getting an understanding of their customers’ patterns and can therefore ready themselves by making the appropriate hiring/firing and physical expansion/subletting decisions as necessary. With a little more effort, they can even learn which demographic(s) to raise awareness with such that in the event of a crisis, this repair shop is top of mind for the affected parties.

This project aims at supporting the annual planning process of a hypothetical repair shop in New York City through careful data analysis and inference that leads to recommendations for the auto repair shop owner(s).

Motivation

Our objective of this project is to use data compiled by New York City police officers (NYPD) detailing New York motor vehicle collision reports to help a fictitious auto repair center in New York City estimate the volume of incoming vehicles they can expect to repair in the coming year, assuming their market share is known. Our analysis will help this repair center predict staffing levels that they will need to maintain and identify potential opportunities for expansion.

Additionally, we analyze the demographics of those involved in collisions and identify the groups that make a significant contribution to car collisions. We propose and measure the impact of a marketing campaign for this repair center. We also analyze the impact of COVID-19 on this industry sector.

Assumptions

We assume equal market share for the 710 (Motor Vehicles) body shops found in New York City. This gives us an average market share of 0.1408% per shop, which we will use to estimate the volume of incoming vehicles they can expect.

Hypotheses

After our initial data exploration, we developed a few hypotheses:

Primary hypotheses

  1. Be able to estimate the monthly demand for a fictitious auto shop in New York state by extrapolating the monthly car crash volume with reasonable accuracy.
  2. Identify a demographic that comprises the largest market segment for motor vehicle repairs after an accident; from our preliminary investigation, we suspect that young men comprise this demographic.

Secondary hypotheses

  1. Be able to assert the impact of advertising expenditure on the auto shop’s bottom line.
  2. Be able to quantify the impact of COVID-19 on demand in this sector.

Literature review

Demographics

In their research, Regev et. al. (Regev, Rolison, & Moutari) evaluated the crash risk of drivers of different ages and genders adjusting for travel exposure and found that drivers between the ages of 21 and 29 carry the highest level of risk for being in a car crash. This contrasts prior research that suggested that teenagers and elderly drivers carry the highest risk for car collisions since that research did not control for the effect of travel exposure (Regev, Rolison, & Moutari, 131). In the team’s initial discussions, we hypothesised that the data would show that teenagers and elderly folks have the highest rate of car collisions. This article broadened our perspective and set us up to let the data guide us.

Location

He et. al. (He et al.) developed a technique to predict where accidents happen. This work would allow policymakers to install traffic calming devices in areas that are predetermined to be risky. While we did not build directly on their work, we were inspired by the advanced models that they built.

COVID-19

Li and Zhao (Li & Zhao) determined that while the overall volume of car collisions has plummeted since the start of the pandemic, cyclists’ fatalities have tripled when the quantity of traffic is controlled for. This presents an area of further research for us. To dig deeper into the true impact of the pandemic, we would have to collect or find data that we could then merge with our existing data sets to glean further insights. Due to the fact that our time in this course is limited, we considered this to be out of scope for this project.

Approach

To perform this analysis, we take several steps:

Data wrangling

In this step, we will read into memory, and clean the data at hand. We use 3 data sets:

  1. Data detailing the collisions, Crashes,
  2. Data detailing the vehicles involved in collisions, Vehicles, and
  3. Data detailing the people involved in the collisions, Person.

After the data is read, we display a few summaries to give the reader a sense of the scope in which we are dealing. We then merge the 3 data sets so that we can perform analysis smoothly.

Data exploration

After the initial wrangling is done, we then explore the data by making a few plots. One effect we reflect on is the effect of COVID-19 on collisions in New York City.

Feature selection

We use several strategies to perform analysis with the features that have the most predictive power:

  1. Manual predictor selection
  2. LASSO
  3. Elastic net

Modeling

We split the data into 3 sets: training, validation and test data sets. We then take advantage of 5 different modeling techniques:

  1. Linear regression,
  2. Random forest, and
  3. Classification and regression tree, and
  4. Gradient Boosted Trees, and
  5. Ensemble of all 4 models above.

For each, we visualize the output and then analyze its performance on the validation data set. We use \(R^2\) to analyze model performance.

Model comparison

The objective here is to summarise the models’ performance metrics to get an instant idea of which model is the best. Note that we compare the models’ performances on the validation data set.

Prediction

Finally, we predict the collision volume on the test data set to estimate the chosen model’s performance. This prediction also serves as the basis for our recommendations for the auto shop.

Effect of a marketing campaign

We then perform an analysis to understand the impact of a marketing campaign on the auto repair shop’s business.

Effect of COVID-19

We also evaluate the effect of COVID-19 on the amount of business that the shop can get.

Conclusion

We then draw conclusions based on the results obtained above.

Useful libraries

We import necessary libraries. These libraries will be used as follows:

Library Use
dplyr Data manipulation
knitr RMarkdown knitting
kableExtra Kable tables
plotly Plotting
lubridate Date manipulation
randomForest Building Random Forest models
rpart Building CART models
rpart.plot Plotting regression trees built
xgboost Building Boosted Tree models
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(lubridate)
library(randomForest)
library(rpart)
library(rpart.plot)
library(xgboost)

Data

Summaries

Below, we import data from 3 sources and view the raw summaries. The data are related to each other as follows: !Data relationships

These functions will be reused in our endeavour to read and summarise the data at hand.

read_data <- function(path) {
  data <- read.csv(path, stringsAsFactors = FALSE)
  colnames(data)[colnames(data) == 'CRASH.DATE'] <- 'CRASH_DATE'
  data$CRASH_DATE <- as.Date(data$CRASH_DATE, "%m/%d/%Y")
  data
}
tabulate_collisions <- function(data) {
  kable(t(summary(data))) %>% kable_classic(full_width = TRUE, html_font = "Cambria",
                                               font_size = 14)
}

Crashes

crashes_df <- read_data('../Data/Crashes.csv') #1,896,229 x 29
tabulate_collisions(crashes_df)
CRASH_DATE Min. :2012-07-01 1st Qu.:2014-10-28 Median :2016-12-15 Mean :2017-01-01 3rd Qu.:2019-01-03 Max. :2022-05-27 NA
CRASH.TIME Length:1895581 Class :character Mode :character NA NA NA NA
BOROUGH Length:1895581 Class :character Mode :character NA NA NA NA
ZIP.CODE Min. :10000 1st Qu.:10306 Median :11207 Mean :10837 3rd Qu.:11237 Max. :11697 NA’s :587485
LATITUDE Min. : 0.00 1st Qu.:40.67 Median :40.72 Mean :40.64 3rd Qu.:40.77 Max. :43.34 NA’s :219988
LONGITUDE Min. :-201.36 1st Qu.: -73.98 Median : -73.93 Mean : -73.77 3rd Qu.: -73.87 Max. : 0.00 NA’s :219988
LOCATION Length:1895581 Class :character Mode :character NA NA NA NA
ON.STREET.NAME Length:1895581 Class :character Mode :character NA NA NA NA
CROSS.STREET.NAME Length:1895581 Class :character Mode :character NA NA NA NA
OFF.STREET.NAME Length:1895581 Class :character Mode :character NA NA NA NA
NUMBER.OF.PERSONS.INJURED Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.2872 3rd Qu.: 0.0000 Max. :43.0000 NA’s :18
NUMBER.OF.PERSONS.KILLED Min. :0.000000 1st Qu.:0.000000 Median :0.000000 Mean :0.001357 3rd Qu.:0.000000 Max. :8.000000 NA’s :31
NUMBER.OF.PEDESTRIANS.INJURED Min. : 0.00000 1st Qu.: 0.00000 Median : 0.00000 Mean : 0.05304 3rd Qu.: 0.00000 Max. :27.00000 NA
NUMBER.OF.PEDESTRIANS.KILLED Min. :0.000000 1st Qu.:0.000000 Median :0.000000 Mean :0.000697 3rd Qu.:0.000000 Max. :6.000000 NA
NUMBER.OF.CYCLIST.INJURED Min. :0.00000 1st Qu.:0.00000 Median :0.00000 Mean :0.02433 3rd Qu.:0.00000 Max. :4.00000 NA
NUMBER.OF.CYCLIST.KILLED Min. :0.0000000 1st Qu.:0.0000000 Median :0.0000000 Mean :0.0001002 3rd Qu.:0.0000000 Max. :2.0000000 NA
NUMBER.OF.MOTORIST.INJURED Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.2082 3rd Qu.: 0.0000 Max. :43.0000 NA
NUMBER.OF.MOTORIST.KILLED Min. :0.00000 1st Qu.:0.00000 Median :0.00000 Mean :0.00055 3rd Qu.:0.00000 Max. :5.00000 NA
CONTRIBUTING.FACTOR.VEHICLE.1 Length:1895581 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.2 Length:1895581 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.3 Length:1895581 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.4 Length:1895581 Class :character Mode :character NA NA NA NA
CONTRIBUTING.FACTOR.VEHICLE.5 Length:1895581 Class :character Mode :character NA NA NA NA
COLLISION_ID Min. : 22 1st Qu.:3046533 Median :3583981 Mean :3020876 3rd Qu.:4058140 Max. :4532390 NA
VEHICLE.TYPE.CODE.1 Length:1895581 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.2 Length:1895581 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.3 Length:1895581 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.4 Length:1895581 Class :character Mode :character NA NA NA NA
VEHICLE.TYPE.CODE.5 Length:1895581 Class :character Mode :character NA NA NA NA

Person

person_df <- read_data('../Data/Person.csv') #4,692,054 x 21
tabulate_collisions(person_df)
UNIQUE_ID Min. : 10922 1st Qu.: 6811622 Median : 8995010 Mean : 8530078 3rd Qu.:10214090 Max. :12236543 NA
COLLISION_ID Min. : 37 1st Qu.:3638730 Median :3921535 Mean :3852979 3rd Qu.:4210243 Max. :4532390 NA
CRASH_DATE Min. :2012-07-01 1st Qu.:2017-03-19 Median :2018-06-07 Mean :2018-07-07 3rd Qu.:2019-09-20 Max. :2022-05-27 NA
CRASH_TIME Length:4689796 Class :character Mode :character NA NA NA NA
PERSON_ID Length:4689796 Class :character Mode :character NA NA NA NA
PERSON_TYPE Length:4689796 Class :character Mode :character NA NA NA NA
PERSON_INJURY Length:4689796 Class :character Mode :character NA NA NA NA
VEHICLE_ID Min. : 123423 1st Qu.:17465986 Median :18528303 Mean :18252666 3rd Qu.:19124376 Max. :20228127 NA’s :185601
PERSON_AGE Min. :-999.0 1st Qu.: 23.0 Median : 35.0 Mean : 36.8 3rd Qu.: 50.0 Max. :9999.0 NA’s :452863
EJECTION Length:4689796 Class :character Mode :character NA NA NA NA
EMOTIONAL_STATUS Length:4689796 Class :character Mode :character NA NA NA NA
BODILY_INJURY Length:4689796 Class :character Mode :character NA NA NA NA
POSITION_IN_VEHICLE Length:4689796 Class :character Mode :character NA NA NA NA
SAFETY_EQUIPMENT Length:4689796 Class :character Mode :character NA NA NA NA
PED_LOCATION Length:4689796 Class :character Mode :character NA NA NA NA
PED_ACTION Length:4689796 Class :character Mode :character NA NA NA NA
COMPLAINT Length:4689796 Class :character Mode :character NA NA NA NA
PED_ROLE Length:4689796 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_1 Length:4689796 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_2 Length:4689796 Class :character Mode :character NA NA NA NA
PERSON_SEX Length:4689796 Class :character Mode :character NA NA NA NA

Vehicles

vehicles_df <- read_data('../Data/Vehicles.csv') #3,704,406 x 25
tabulate_collisions(vehicles_df)
UNIQUE_ID Min. : 111711 1st Qu.:14215160 Median :17306058 Mean :16060871 3rd Qu.:18739205 Max. :20121717 NA
COLLISION_ID Min. : 22 1st Qu.:3017853 Median :3567068 Mean :2996659 3rd Qu.:4028145 Max. :4484197 NA
CRASH_DATE Min. :2012-07-01 1st Qu.:2014-10-15 Median :2016-11-18 Mean :2016-11-21 3rd Qu.:2018-11-15 Max. :2021-12-04 NA
CRASH_TIME Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_ID Length:3704406 Class :character Mode :character NA NA NA NA
STATE_REGISTRATION Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_TYPE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_MAKE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_MODEL Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_YEAR Min. : 1000 1st Qu.: 2008 Median : 2013 Mean : 2015 3rd Qu.: 2016 Max. :20063 NA’s :1796971
TRAVEL_DIRECTION Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_OCCUPANTS Min. :0.00e+00 1st Qu.:1.00e+00 Median :1.00e+00 Mean :1.01e+03 3rd Qu.:1.00e+00 Max. :1.00e+09 NA’s :1718406
DRIVER_SEX Length:3704406 Class :character Mode :character NA NA NA NA
DRIVER_LICENSE_STATUS Length:3704406 Class :character Mode :character NA NA NA NA
DRIVER_LICENSE_JURISDICTION Length:3704406 Class :character Mode :character NA NA NA NA
PRE_CRASH Length:3704406 Class :character Mode :character NA NA NA NA
POINT_OF_IMPACT Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_1 Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_2 Length:3704406 Class :character Mode :character NA NA NA NA
VEHICLE_DAMAGE_3 Length:3704406 Class :character Mode :character NA NA NA NA
PUBLIC_PROPERTY_DAMAGE Length:3704406 Class :character Mode :character NA NA NA NA
PUBLIC_PROPERTY_DAMAGE_TYPE Length:3704406 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_1 Length:3704406 Class :character Mode :character NA NA NA NA
CONTRIBUTING_FACTOR_2 Length:3704406 Class :character Mode :character NA NA NA NA

Note the large sizes of the data sets we are dealing with.

Wrangling

As a result of a traffic safety initiative to eliminate traffic fatalities, the NYPD replaced its record management system with an electronic one, (FORMS), in 2016 ((“Motor Vehicle Collisions - Crashes”)). We see this change reflected in the chart below.

monthly_agg1 <- (person_df %>% 
                   mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>% 
                   group_by(yr_mo) %>% 
                   summarise(n = n()) %>% 
                   arrange(yr_mo)
                 )
plot_ly(x=monthly_agg1$yr_mo, y=monthly_agg1$n, type='bar') %>%
  layout(title = "Crash Data by Month from 2012-2022", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))

Note that the amount of data collected from March 2016 on greatly surpasses the amount previously collected. To control for the change in data recording systems, we will use data collected after January 1st, 2017 for full year modeling.

We are now ready to filter the data to select columns of interest. We will also combine data from our three sources into one place, using their common factors. A summary can be found below.

combined_df <- (crashes_df %>% 
                  select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
                            CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
                            VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
                            NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
                            NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
                            ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>% 
                  inner_join(vehicles_df %>% 
                               filter(!is.na(VEHICLE_ID)) %>% 
                               select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
                             , by = 'COLLISION_ID') %>% 
                  inner_join(person_df %>% 
                               select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
                             , by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>% 
                  filter((PED_ROLE %in% c('Driver'))) %>%  #drivers only
                  filter((CRASH_DATE >= '2017-01-01') & (CRASH_DATE < '2021-12-01')) %>% #only from 2017-01-01 to 2021-11-30
                  filter(PERSON_AGE > 14 & PERSON_AGE< 101) %>% 
                  filter(PERSON_SEX == 'F' | PERSON_SEX=='M') %>% 
                  mutate(MONTH = substr(CRASH_DATE,6,7)) %>% 
                  mutate(TIMESTEP = as.period(interval(as.Date('2017-01-01'), CRASH_DATE)) %/% months(1)) %>% 
                  mutate(yr_mo = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7))) %>%
                  select(COLLISION_ID, CRASH_DATE, PERSON_AGE, PERSON_SEX, MONTH, TIMESTEP, yr_mo) %>% 
                  arrange(CRASH_DATE)
                )
kable(t(summary(combined_df))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
COLLISION_ID Min. :3589945 1st Qu.:3805149 Median :4017960 Mean :4023043 3rd Qu.:4234930 Max. :4484186
CRASH_DATE Min. :2017-01-01 1st Qu.:2017-12-04 Median :2018-11-05 Mean :2018-12-28 3rd Qu.:2019-11-01 Max. :2021-11-30
PERSON_AGE Min. : 15.00 1st Qu.: 29.00 Median : 40.00 Mean : 41.69 3rd Qu.: 53.00 Max. :100.00
PERSON_SEX Length:1339106 Class :character Mode :character NA NA NA
MONTH Length:1339106 Class :character Mode :character NA NA NA
TIMESTEP Min. : 0.00 1st Qu.:11.00 Median :22.00 Mean :23.44 3rd Qu.:34.00 Max. :58.00
yr_mo Length:1339106 Class :character Mode :character NA NA NA

Now, our data set is ready for exploration.

Data Exploration

We create our aggregate of volume data available for modeling and visualize it with a plot. It is interesting to note the impact of COVID-19 on crash volume beginning in March 2020.

monthly_agg2 <- (combined_df %>% 
                   group_by(TIMESTEP, yr_mo, MONTH) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP, yr_mo, MONTH)
                 )
plot_ly(x=monthly_agg2$yr_mo, y=monthly_agg2$n, type='bar') %>%
  layout(title = "Crash Data by Month from 2017-2021", xaxis = list(title = 'Month'), yaxis = list(title = 'Number of Crashes'))

Modeling

Splitting into training, validation, and testing data sets

We also created a coarse aggregate for modeling with feature groups and separate our data into training, validation, and test data sets. Since we are dealing with a temporal model, we will select 2017-01 to 2019-12 for training, 2020-01 to 2020-12 for validation, and for 2021-01 to 2021-11 testing. A random split would be nonsensical for our purposes. After hyper parameters for each model have been tuned using training and validation data sets, the models will be re-trained using the training + validation data sets. This ensures that we maximize the data for learning, but also that validation steps are completed properly.

monthly_agg3 <- (combined_df %>% 
                   group_by(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP, yr_mo, MONTH, PERSON_SEX, PERSON_AGE)
)

#Create Training and Test data sets
train_df <- monthly_agg3 %>% filter(TIMESTEP <= 35) # 2017-01 to 2019-12
val_df <- monthly_agg3 %>% filter((TIMESTEP > 35) & (TIMESTEP <= 47)) # 2020-01 to 2020-12
test_df <- monthly_agg3 %>% filter(TIMESTEP > 47) # 2021-01 to 2021-11

#Create Train_Val data sets (final model after Train/Val hyper parameter tuning)
train_val_df <- monthly_agg3 %>% filter(TIMESTEP <= 47) # 2017-01 to 2020-12

Linear Regression

Training

#Train Linear Model
lm_model <- lm(data = train_df, formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX )
summary(lm_model)
## 
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, 
##     data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -463.96  -73.77   15.27   66.20  269.38 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 335.67839    6.71275  50.006  < 2e-16 ***
## TIMESTEP     -0.71945    0.15283  -4.708 2.57e-06 ***
## MONTH02     -13.56568    7.36429  -1.842 0.065513 .  
## MONTH03      13.23735    7.36933   1.796 0.072503 .  
## MONTH04       5.12239    7.37183   0.695 0.487170    
## MONTH05      35.55638    7.38663   4.814 1.52e-06 ***
## MONTH06      35.40924    7.39397   4.789 1.72e-06 ***
## MONTH07      20.93237    7.40274   2.828 0.004705 ** 
## MONTH08      15.48523    7.43398   2.083 0.037292 *  
## MONTH09      19.43578    7.45095   2.608 0.009117 ** 
## MONTH10      28.85893    7.47622   3.860 0.000115 ***
## MONTH11      21.90978    7.50981   2.917 0.003542 ** 
## MONTH12      20.60703    7.55127   2.729 0.006373 ** 
## PERSON_AGE   -4.45608    0.06404 -69.578  < 2e-16 ***
## PERSON_SEXM 171.44422    2.99888  57.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.2 on 5791 degrees of freedom
## Multiple R-squared:  0.5799, Adjusted R-squared:  0.5788 
## F-statistic: 570.9 on 14 and 5791 DF,  p-value: < 2.2e-16
SST_lm_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_lm_train <- sum((train_df$n - predict(lm_model, newdata = train_df))^2)
R_sq_lm_train <- 1-(SSE_lm_train/SST_lm_train)
R_sq_lm_train #0.5799
## [1] 0.5798524
#Linear Model Validation Metrics
SST_lm_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_lm_val <- sum((val_df$n - predict(lm_model, newdata = val_df))^2)
R_sq_lm_val <- 1-(SSE_lm_val/SST_lm_val)
R_sq_lm_val #-1.154703
## [1] -1.154616
#Train_Val Linear Model
lm_model_2 <- lm(data = train_val_df, formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX )
summary(lm_model_2)
## 
## Call:
## lm(formula = n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, 
##     data = train_val_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -441.83  -64.83    4.23   58.78  299.83 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 352.82598    5.76802  61.169  < 2e-16 ***
## TIMESTEP     -2.92530    0.09545 -30.646  < 2e-16 ***
## MONTH02     -10.16351    6.23069  -1.631 0.102889    
## MONTH03       5.85024    6.23280   0.939 0.347955    
## MONTH04     -17.30059    6.27380  -2.758 0.005837 ** 
## MONTH05      13.68088    6.25906   2.186 0.028863 *  
## MONTH06      19.15099    6.25343   3.062 0.002203 ** 
## MONTH07      13.20462    6.25123   2.112 0.034691 *  
## MONTH08      12.80755    6.26343   2.045 0.040908 *  
## MONTH09      18.12040    6.27354   2.888 0.003883 ** 
## MONTH10      27.72229    6.28431   4.411 1.04e-05 ***
## MONTH11      22.61062    6.30315   3.587 0.000336 ***
## MONTH12      23.03260    6.32483   3.642 0.000273 ***
## PERSON_AGE   -3.92660    0.05504 -71.344  < 2e-16 ***
## PERSON_SEXM 150.90222    2.55012  59.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 111.3 on 7619 degrees of freedom
## Multiple R-squared:  0.5457, Adjusted R-squared:  0.5449 
## F-statistic: 653.8 on 14 and 7619 DF,  p-value: < 2.2e-16
SST_lm_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_lm_train_val <- sum((train_val_df$n - predict(lm_model_2, newdata = train_val_df))^2)
R_sq_lm_train_val <- 1-(SSE_lm_train_val/SST_lm_train_val)
R_sq_lm_train_val #0.545718
## [1] 0.5457191

Visualisation

plot_ly(x=train_val_df$yr_mo, y=(predict(lm_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of Linear Model Residuals vs yr_mo for Train_Val Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'LM Residuals', rangemode = "tozero"))

Test Metrics

# Verify performance on Test dataset
SST_lm_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_lm_test <- sum((test_df$n - predict(lm_model_2, newdata = test_df))^2)
R_sq_lm_test <- 1-(SSE_lm_test/SST_lm_test)
R_sq_lm_test #0.08348344
## [1] 0.08364779

Our linear model with training + validation data had an R-squared value of 0.55, while it’s performance on test data dropped the R-squared value to 0.08.

Random forest

Training

# #Train & Val Random Forest Model - Loop - Run only once to get optimal settings
# rf_results <- data.frame()
# for(i in seq.int(1, 200)){
#   set.seed(12345)
#   rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=i)  #
#   summary(rf_model)
#   var_importance_df <- data.frame(importance(rf_model)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
#   var_importance_df
#   SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
#   SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
#   R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
#   R_sq_rf_train #0.8670912
# 
#   #Random Forest Model Validation Metrics
#   SST_rf_val <- sum((val_df$n - mean(val_df$n))^2)
#   SSE_rf_val <- sum((val_df$n - predict(rf_model, newdata = val_df))^2)
#   R_sq_rf_val <- 1-(SSE_rf_val/SST_rf_val)
#   R_sq_rf_val #-0.5731295
# 
#   rf_results <- rf_results %>% bind_rows(data.frame(i=i, R_sq_rf_train=R_sq_rf_train, R_sq_rf_val=R_sq_rf_val))
# }
# 
# rf_results %>% filter(rf_results$R_sq_rf_val == max(rf_results$R_sq_rf_val))
# # i R_sq_rf_train R_sq_rf_val
# # 1 41     0.8668467    -0.49019

# Best Model from Train & Val
set.seed(12345)
rf_model <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, importance=TRUE, ntree=41)  #41
SST_rf_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_rf_train <- sum((train_df$n - predict(rf_model, newdata = train_df))^2)
R_sq_rf_train <- 1-(SSE_rf_train/SST_rf_train)
R_sq_rf_train #0.8668467
## [1] 0.8503601
#Random Forest Model Validation Metrics
SST_rf_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_rf_val <- sum((val_df$n - predict(rf_model, newdata = val_df))^2)
R_sq_rf_val <- 1-(SSE_rf_val/SST_rf_val)
R_sq_rf_val #-0.49019
## [1] -0.6534223
# Train_Val RF model
set.seed(12345)
rf_model_2 <- randomForest(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_val_df, importance=TRUE, ntree=41)  #41
rf_var_importance_df <- data.frame(importance(rf_model_2)) %>% rename('PCT_IncMSE'='X.IncMSE') %>% arrange(desc(PCT_IncMSE))
SST_rf_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_rf_train_val <- sum((train_val_df$n - predict(rf_model_2, newdata = train_val_df))^2)
R_sq_rf_train_val <- 1-(SSE_rf_train_val/SST_rf_train_val)
R_sq_rf_train_val #0.8480507
## [1] 0.8467027

Visualisation

# Make plot for Random Forest Feature Importance
plot_ly(x=rf_var_importance_df$PCT_IncMSE , y=reorder(rf_var_importance_df %>% row.names(), rf_var_importance_df$PCT_IncMSE ), type='bar', orientation = 'h') %>%
  layout(title = "Random Forest Feature Importance", xaxis = list(title = '% Increase MSE'), yaxis = list(title = 'Feature'))
# Make plot for Random Forest Residuals
plot_ly(x=train_val_df$yr_mo, y=(predict(rf_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>%
  layout(title = 'Plot of Random Forest Model Residuals vs yr_mo for Train_Val Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'RF Residuals', rangemode = "tozero"))

Test Metrics

SST_rf_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_rf_test <- sum((test_df$n - predict(rf_model_2, newdata = test_df))^2)
R_sq_rf_test <- 1-(SSE_rf_test/SST_rf_test)
R_sq_rf_test #0.7298947
## [1] 0.766301

Our random forest model with training + validation data had an R-squared value of 0.85, while it’s performance on test data dropped the R-squared value to 0.73.

Classification and Regression Tree (CART)

Training

#Train CART Model
set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_df, 
                    control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_CART_train <- sum((train_df$n - predict(cart_model, newdata = train_df))^2)
R_sq_cart_train <- 1-(SSE_CART_train/SST_CART_train)
R_sq_cart_train #0.9473921
## [1] 0.9473918
#CART Model Validation Metrics
SST_cart_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_cart_val <- sum((val_df$n - predict(cart_model, newdata = val_df))^2)
R_sq_cart_val <- 1-(SSE_cart_val/SST_cart_val)
R_sq_cart_val #-1.753216
## [1] -1.75313
# Train_Val CART model
set.seed(12345)
min_leaf <- 3
min_split <- 3*min_leaf
cart_model_2 <- rpart(n ~ TIMESTEP + MONTH + PERSON_AGE + PERSON_SEX, data=train_val_df, 
                    control = c(minsplit = min_split, minbucket = min_leaf, cp=0.01)) #default = 0.01
SST_CART_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_CART_train_val <- sum((train_val_df$n - predict(cart_model_2, newdata = train_val_df))^2)
R_sq_cart_train_val <- 1-(SSE_CART_train_val/SST_CART_train_val)
R_sq_cart_train_val #0.9023342
## [1] 0.9023338
cart_plot_data <- (data.frame(Variable_Importance = cart_model_2$variable.importance, 
                             Variable_Importance_Pct_Tot = round(100*cart_model_2$variable.importance/sum(cart_model_2$variable.importance),0)) %>% 
                     as.data.frame())

Visualisation

# Make plot for CART Feature Importance
plot_ly(x=cart_plot_data$Variable_Importance_Pct_Tot , y=reorder(cart_plot_data %>% row.names(), cart_plot_data$Variable_Importance_Pct_Tot ), type='bar', orientation = 'h') %>%
  layout(title = "CART Forest Feature Importance", xaxis = list(title = 'Variable_Importance_Pct_Tot'), yaxis = list(title = 'Feature'))
# Make plot for CART diagram
rpart.plot(cart_model_2)

# Make plot for CART Residuals
plot_ly(x=train_val_df$yr_mo, y=(predict(cart_model_2, newdata = train_val_df)-train_val_df$n), type='scatter', mode='markers') %>% 
  layout(title = 'Plot of CART Model Residuals vs yr_mo for Train_Val Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'CART Residuals', rangemode = "tozero"))

Test Metrics

SST_cart_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_cart_test <- sum((test_df$n - predict(cart_model_2, newdata = test_df))^2)
R_sq_cart_test <- 1-(SSE_cart_test/SST_cart_test)
R_sq_cart_test #0.6526338
## [1] 0.65263

Our cart model with training + validation data had an R-squared value of 0.90, while it’s performance on test data dropped the R-squared value to 0.65.

Extreme Gradient Boosted Tree (XGBoost)

Training

# XGB dataset feature changes (factor levels not supported, so integer casting)
train_xgb_df <- (train_df %>% 
                   select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>% 
                   mutate(MONTH=as.integer(as.factor(MONTH)),
                          PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))

val_xgb_df <- (val_df %>% 
                 select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>% 
                 mutate(MONTH=as.integer(as.factor(MONTH)),
                        PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))

train_val_xgb_df <- (train_val_df %>% 
                       select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>% 
                       mutate(MONTH=as.integer(as.factor(MONTH)),
                              PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))

test_xgb_df <- (test_df %>% 
                  select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>% 
                  mutate(MONTH=as.integer(as.factor(MONTH)),
                         PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))

# # #Train & Val Boosted Tree Model - Loop - Run only once to get optimal settings  
# #Train Boosted Tree Model
# bt_results <- data.frame()
# for(i in seq.int(1, 20)){
#   for(j in seq.int(3,6)){
#     set.seed(12345)
#     bt_model <- xgboost(data=as.matrix(train_xgb_df), label = train_df$n, objective='reg:squarederror', nthread=1,  nrounds=i, max.depth=j, eta=0.05)  #
#     summary(bt_model)
#     
#     SST_bt_train <- sum((train_df$n - mean(train_df$n))^2)
#     SSE_bt_train <- sum((train_df$n - predict(bt_model, newdata = as.matrix(train_xgb_df)))^2)
#     R_sq_bt_train <- 1-(SSE_bt_train/SST_bt_train)
#     R_sq_bt_train #0.4761988
#     
#     #Boosted Tree Model Validation Metrics
#     SST_bt_val <- sum((val_df$n - mean(val_df$n))^2)
#     SSE_bt_val <- sum((val_df$n - predict(bt_model, newdata = as.matrix(val_xgb_df)))^2)
#     R_sq_bt_val <- 1-(SSE_bt_val/SST_bt_val)
#     R_sq_bt_val #0.7243043
#     
#     bt_results <- bt_results %>% bind_rows(data.frame(n_trees=i, max_depth=j, R_sq_bt_train=R_sq_bt_train, R_sq_bt_val=R_sq_bt_val))
# 
#   }
# }
# 
# plot_ly(x=bt_results$n_trees, y=bt_results$R_sq_bt_train, type='scatter', mode='lines+markers', color=as.factor(paste0('Train-',bt_results$max_depth))) %>%
#   add_trace(x=bt_results$n_trees, y=bt_results$R_sq_bt_val, type='scatter', mode='lines+markers', color=as.factor(paste0('Val-',bt_results$max_depth))) %>% 
#   layout(title = 'Plot of Boosted Tree Train and Val R_Sq vs Number of Trees (depth in name)',
#          xaxis = list(title = 'Number of Trees'),
#          yaxis = list(title = 'R_Sq'))
# 
# 
# bt_results %>% filter(bt_results$R_sq_bt_val == max(bt_results$R_sq_bt_val))
# # n_trees max_depth R_sq_bt_train R_sq_bt_val
# # 12         5     0.3742168   0.7650132

# Best model from Train/Val tuning
set.seed(12345)
bt_model <- xgboost(data=as.matrix(train_xgb_df), label = train_df$n, objective='reg:squarederror', nthread=1,  nrounds=12, max.depth=5, eta=0.05, verbose=0)
SST_bt_train <- sum((train_df$n - mean(train_df$n))^2)
SSE_bt_train <- sum((train_df$n - predict(bt_model, newdata = as.matrix(train_xgb_df)))^2)
R_sq_bt_train <- 1-(SSE_bt_train/SST_bt_train)
R_sq_bt_train #0.3742168
## [1] 0.374216
#Boosted Tree Model Validation Metrics
SST_bt_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_bt_val <- sum((val_df$n - predict(bt_model, newdata = as.matrix(val_xgb_df)))^2)
R_sq_bt_val <- 1-(SSE_bt_val/SST_bt_val)
R_sq_bt_val #0.7650132
## [1] 0.765024
# Train_Val Boosted Tree model
set.seed(12345)
bt_model_2 <- xgboost(data=as.matrix(train_val_xgb_df), label = train_val_df$n, objective='reg:squarederror', nthread=1,  nrounds=12, max.depth=5, eta=0.05, verbose=0)
xgb.plot.importance(xgb.importance(model=bt_model_2))

SST_bt_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_bt_train_val <- sum((train_val_df$n - predict(bt_model_2, newdata = as.matrix(train_val_xgb_df)))^2)
R_sq_bt_train_val <- 1-(SSE_bt_train_val/SST_bt_train_val)
R_sq_bt_train_val #0.4057451
## [1] 0.4057425

Visualisation

plot_ly(x=train_val_df$yr_mo, y=(predict(bt_model_2, newdata = as.matrix(train_val_xgb_df))-train_val_df$n), type='scatter', mode='markers') %>%
  layout(title = 'Plot of Boosted Tree Model Residuals vs yr_mo for Train_Val Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'BT Residuals', rangemode = "tozero"))

Test Metrics

SST_bt_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_bt_test <- sum((test_df$n - predict(bt_model_2, newdata = as.matrix(test_xgb_df)))^2)
R_sq_bt_test <- 1-(SSE_bt_test/SST_bt_test)
R_sq_bt_test #0.2349657
## [1] 0.2349213

Our XGBoost model with training + validation data had an R-squared value of 0.41, while it’s performance on test data dropped the R-squared value to 0.23.

Ensemble Model

Training

# Ensemble final models for better performance
val_pred_actual <- val_df$n
val_pred_lm <- predict(lm_model_2, newdata = val_df)
val_pred_rf <- predict(rf_model_2, newdata = val_df)
val_pred_cart <- predict(cart_model_2, newdata = val_df)
val_pred_bt <- predict(bt_model_2, newdata = as.matrix(val_xgb_df))

ensemble_model <- lm(val_pred_actual ~ val_pred_lm + val_pred_rf + val_pred_cart + val_pred_bt)
summary(ensemble_model)
## 
## Call:
## lm(formula = val_pred_actual ~ val_pred_lm + val_pred_rf + val_pred_cart + 
##     val_pred_bt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.626  -12.280    0.429   10.726  108.447 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -14.641719   1.434035 -10.210   <2e-16 ***
## val_pred_lm     0.007157   0.009049   0.791    0.429    
## val_pred_rf     0.315755   0.035711   8.842   <2e-16 ***
## val_pred_cart  -0.085805   0.015223  -5.637    2e-08 ***
## val_pred_bt     1.840335   0.062587  29.404   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.6 on 1823 degrees of freedom
## Multiple R-squared:  0.9131, Adjusted R-squared:  0.9129 
## F-statistic:  4787 on 4 and 1823 DF,  p-value: < 2.2e-16
#  =>  ensemble model will be -15.493854 + 1.746018*bt_model_2 + 0.378762*rf_model_2 - 0.088279*cart_model_2

# Ensemble model result calculations
val_pred_ensemble <- -15.493854 + 1.746018*val_pred_bt + 0.378762*val_pred_rf - 0.088279*val_pred_cart

SST_ensemble_val <- sum((val_df$n - mean(val_df$n))^2)
SSE_ensemble_val <- sum((val_df$n - val_pred_ensemble)^2)
R_sq_ensemble_val <- 1-(SSE_ensemble_val/SST_ensemble_val)
R_sq_ensemble_val #0.9139694
## [1] 0.9128438
train_val_pred_ensemble <- -15.493854 + 1.746018*predict(bt_model_2, newdata = as.matrix(train_val_xgb_df)) + 0.378762*predict(rf_model_2, newdata = train_val_df) - 0.088279*predict(cart_model_2, newdata = train_val_df)
SST_ensemble_train_val <- sum((train_val_df$n - mean(train_val_df$n))^2)
SSE_ensemble_train_val <- sum((train_val_df$n - train_val_pred_ensemble)^2)
R_sq_ensemble_train_val <- 1-(SSE_ensemble_train_val/SST_ensemble_train_val)
R_sq_ensemble_train_val #0.9624101
## [1] 0.961873
ensemble_plot_data <- data.frame(n_ensemble = as.numeric(train_val_pred_ensemble), n=train_val_df$n, yr_mo=train_val_df$yr_mo)

Visualisation

plot_ly(x=ensemble_plot_data$yr_mo, y=(ensemble_plot_data$n_ensemble - ensemble_plot_data$n), type='scatter', mode='markers') %>%
  layout(title = 'Plot of Ensemble Model Residuals vs yr_mo for Train_Val Data',
         xaxis = list(title = 'yr_mo'),
         yaxis = list(title = 'Ensemble Residuals', rangemode = "tozero"))

Test Metrics

test_pred_ensemble <- -15.493854 + 1.746018*predict(bt_model_2, newdata = as.matrix(test_xgb_df))  + 0.378762*predict(rf_model_2, newdata = test_df) - 0.088279*predict(cart_model_2, newdata = test_df)
SST_ensemble_test <- sum((test_df$n - mean(test_df$n))^2)
SSE_ensemble_test <- sum((test_df$n - test_pred_ensemble)^2)
R_sq_ensemble_test <- 1-(SSE_ensemble_test/SST_ensemble_test)
R_sq_ensemble_test #0.8646812
## [1] 0.87298

Our Ensemble model with training + validation data had an R-squared value of 0.96, while it’s performance on test data dropped the R-squared value to 0.86.

Model comparison

Each model’s performance can be compared in the following interactive plot. The red line represents a perfect model. Notice how much better the ensemble model does than the other individual models.

Test Data

plot_ly(x=test_df$n, y=(predict(cart_model_2, newdata = test_df)), type='scatter', mode='markers', name='CART', marker=list(color='#1f77b4', opacity=0.5)) %>% 
  add_trace(x=test_df$n, y=(predict(rf_model_2, newdata = test_df)), type='scatter', mode='markers', name='RF', marker=list(color='#ff7f0e', opacity=0.5)) %>% 
  add_trace(x=test_df$n, y=(predict(lm_model_2, newdata = test_df)), type='scatter', mode='markers', name='LM', marker=list(color='#2ca02c', opacity=0.5)) %>%
  add_trace(x=c(0,350), y=c(0,350), type='scatter', mode='lines+markers', name='Perfect', alpha=1, line=list(color='#d62728'), marker=list(color='#2ca02c', opacity=0)) %>% 
  add_trace(x=test_df$n, y=(predict(bt_model_2, newdata = as.matrix(test_xgb_df))), type='scatter', mode='markers', name='BT', marker=list(color='#9467bd', opacity=0.5)) %>%
  add_trace(x=test_df$n, y=(test_pred_ensemble), type='scatter', mode='markers', name='Ensemble', marker=list(color='#8c564b', opacity=0.5)) %>%
  layout(title = 'Plot of All Model Predictions vs Actual Values - Test Data',
         xaxis = list(title = 'Actual Value', range=c(0,375)),
         yaxis = list(title = 'Model Prediction', range=c(-200,350)))

Overall Model Performance Summary Table

kable(data.frame(Model_Type = c('Linear', 'Random Forest', 'CART', 'Boosted Trees', 'Ensemble'),
                 R_sq_train = round(c(R_sq_lm_train, R_sq_rf_train, R_sq_cart_train, R_sq_bt_train, as.integer(NA)), 2),
                 R_sq_val = round(c(R_sq_lm_val, R_sq_rf_val, R_sq_cart_val, R_sq_bt_val, R_sq_ensemble_val), 2),
                 R_sq_train_val = round(c(R_sq_lm_train_val, R_sq_rf_train_val, R_sq_cart_train_val, R_sq_bt_train_val, R_sq_ensemble_train_val), 2),
                 R_sq_test = round(c(R_sq_lm_test, R_sq_rf_test, R_sq_cart_test, R_sq_bt_test, R_sq_ensemble_test), 2))) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria") %>%
  row_spec(c(5), background = c("yellow"))
Model_Type R_sq_train R_sq_val R_sq_train_val R_sq_test
Linear 0.58 -1.15 0.55 0.08
Random Forest 0.85 -0.65 0.85 0.77
CART 0.95 -1.75 0.90 0.65
Boosted Trees 0.37 0.77 0.41 0.23
Ensemble NA 0.91 0.96 0.87

We can see that our Ensemble model has the best performance on the Train + Validation data, so we will move forward with this model. We will now estimate the monthly volume our auto body repair center can expect in the future, using our expected market share of 0.1408%.

We therefore select the Ensemble model due to its superior performance.

Prediction

We continue with the assumption here is that our repair shop will enjoy perfect competition. This would mean that it would have an equal share of the demand i.e. 0.1408%. With this assumption in tow, we estimate the shop’s future monthly car volume.

Computation

First, we formally compute the market share of the shop.

body_shop_count <- 710
Shop_Market_Share <- 0.1408/100

Now, we use the test data that we had set aside to predict the monthly collision volume.

test_agg_df <- (test_df %>% 
                  group_by(MONTH) %>% 
                  summarise(Actual_NYC_Collisions=sum(n), 
                            Actual_Shop_Volume=round(Shop_Market_Share*sum(n), 0)))

Here, we create dataset for future year (2021).

temp1 <- data.frame(TIMESTEP = c(48:59))
temp2 <- data.frame(MONTH = c('01','02','03','04','05','06','07','08','09','10','11','12'))
temp3 <- data.frame(PERSON_AGE = c(15:100))
temp4 <- data.frame(PERSON_SEX = c('M','F'))
monthly_predictions_df <- temp1 %>%
  bind_cols(temp2) %>%
  full_join(temp3, by=character()) %>%
  full_join(temp4, by=character())

And finally, we make predictions.

monthly_predictions_xgb_df <- (monthly_predictions_df %>% 
                                 select(TIMESTEP,MONTH,PERSON_AGE,PERSON_SEX) %>% 
                                 mutate(MONTH=as.integer(as.factor(MONTH)),
                                        PERSON_SEX=ifelse(PERSON_SEX=='M',0,1)))
#Create predictions
monthly_predictions_df$rf <- predict(rf_model_2, newdata = monthly_predictions_df)
monthly_predictions_df$cart <- predict(cart_model_2, newdata = monthly_predictions_df)
monthly_predictions_df$bt <- predict(bt_model_2, newdata = as.matrix(monthly_predictions_xgb_df))
monthly_predictions_df$ensemble <- (-15.493854 
                                    + 1.746018*monthly_predictions_df$bt  
                                    + 0.378762*monthly_predictions_df$rf 
                                    - 0.088279*monthly_predictions_df$cart)

monthly_predictions_agg_df <- (monthly_predictions_df %>% 
                                 group_by(MONTH) %>% 
                                 summarise(Predicted_NYC_Collisions=round(sum(ensemble), 0)) %>%
                                 mutate(Predicted_Shop_Volume = round(Shop_Market_Share*Predicted_NYC_Collisions, 0)) %>% 
                                 left_join(test_agg_df, by='MONTH') %>% 
                                 mutate(YEAR = 2021) %>% 
                                 select(YEAR, MONTH, Actual_NYC_Collisions, Actual_Shop_Volume, Predicted_NYC_Collisions, Predicted_Shop_Volume)
)   

kable(monthly_predictions_agg_df) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
YEAR MONTH Actual_NYC_Collisions Actual_Shop_Volume Predicted_NYC_Collisions Predicted_Shop_Volume
2021 01 9671 14 11472 16
2021 02 8432 12 11079 16
2021 03 10648 15 10953 15
2021 04 11501 16 10011 14
2021 05 13394 19 10761 15
2021 06 13745 19 10605 15
2021 07 12645 18 10742 15
2021 08 12473 18 10874 15
2021 09 12623 18 10874 15
2021 10 13097 18 11070 16
2021 11 11522 16 10745 15
2021 12 NA NA 10589 15

Plot of actual versus predicted demand volume

#Plot of Actual_Shop_Volume and Predicted_Shop_Volume
plot_ly(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Actual_Shop_Volume, 
        type='bar', name='Actual_Shop_Volume') %>% 
  add_trace(x=monthly_predictions_agg_df$MONTH, y=monthly_predictions_agg_df$Predicted_Shop_Volume, type='bar', name='Predicted_Shop_Volume') %>% 
  layout(title = 'Plot of Actual_Shop_Volume and Predicted_Shop_Volume (Test Set)',
         xaxis = list(title = 'Months in 2021'),
         yaxis = list(title = 'Car volume'))

Primary research question response

Recall that our primary research question is to “estimate the monthly demand for a fictitious auto shop in New York state by extrapolating the monthly car crash volume with reasonable accuracy”. We can see that our ensemble model actually does a very good job predicting the shop volume, compared the actual volume the shop saw in 2021. We hypothesize that our model would have performed even better without the unforeseen consequences of COVID-19 and the subsequent increase of remote work availability.

Our next task is to start answering our secondary research objectives, including identifying key demographics for use in a marketing campaign and then measuring its cost and effect.

Proportion of crashes by month, gender

kable(train_df %>% 
        group_by(PERSON_SEX) %>% 
        summarise(n = sum(n)) %>% 
    arrange(PERSON_SEX)) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
PERSON_SEX n
F 275792
M 779863

Proportion of crashes by month, age group

kable(train_df %>% 
        mutate(age_group = 5*floor(PERSON_AGE/5)) %>%
        group_by(age_group) %>% 
        summarise(n = sum(n)) %>% 
        arrange(age_group)) %>% 
  kable_classic(full_width = FALSE, html_font = "Cambria")
age_group n
15 22473
20 97546
25 139152
30 134433
35 119523
40 106615
45 102786
50 98369
55 87184
60 64833
65 39100
70 22739
75 11642
80 5883
85 2416
90 763
95 180
100 18

Marketing Campaign

With the assumption that the auto body repair shop has a 0.14% share of the NYC business, we set a goal to increase it to about 0.25%. To do so, we considered a few options and settled on a marketing campaign that focuses on digital search advertisement.

We started off by pulling together the monthly numbers from 2021 and added a new column for the revenue we expect to generate. Based on information from the 2017 Shop Performance Survey (Bean), the average repair in the United States costs $200-$399. However, we chose to use $2400 as the average repair order as collision repairs are significantly higher. The same source also listed the average gross profit and and net profit margins (49% and 14% respectively) for similar shops in the United States. Below is a table showing the calculations.

#Average Repair Shop Financial Metrics
average_repair_order <- 2400  #Average repair is about $399
gross_profit_margin <- .49
net_profit_margin <- .14

marketing_campaign_df <- data.frame(monthly_predictions_agg_df)

marketing_campaign_df <- marketing_campaign_df %>%
                          mutate(monthly_revenue = marketing_campaign_df$Predicted_Shop_Volume* average_repair_order)

marketing_campaign_df %>%
  kbl() %>%
  kable_paper("hover", full_width = F, html_font = "Cambria")
YEAR MONTH Actual_NYC_Collisions Actual_Shop_Volume Predicted_NYC_Collisions Predicted_Shop_Volume monthly_revenue
2021 01 9671 14 11472 16 38400
2021 02 8432 12 11079 16 38400
2021 03 10648 15 10953 15 36000
2021 04 11501 16 10011 14 33600
2021 05 13394 19 10761 15 36000
2021 06 13745 19 10605 15 36000
2021 07 12645 18 10742 15 36000
2021 08 12473 18 10874 15 36000
2021 09 12623 18 10874 15 36000
2021 10 13097 18 11070 16 38400
2021 11 11522 16 10745 15 36000
2021 12 NA NA 10589 15 36000

We also obtained digital advertising data from the 2021 Paid Search Advertising Benchmarks for Every Industry article by Kristen McCormick (McCormick). We noticed that both the click-through and conversion rate, 5.39% and 15.23% respectively, were very good and presented a real opportunity to archive our goal of increasing the market share. For comparison, we ran the numbers based on three levels of spend ($200, $300, and $400). We summarized the calculations in the table below.

# 2021 Paid Search Industry Benchmarks (from Kristen McCormick article)
CTR <- 0.0539  #Average click-through rate 5.39%
CPC <- 3.19  #Average cost per click
conversion_rate <- 0.1523  #Average conversion rate is 15.23%
cost_per_lead <- 17.81

#Create a function that will create a column for each level of spend
create_col <- function(spend) {
  clicks <- spend/CPC
  conversion <- clicks*conversion_rate
  revenue <- conversion*average_repair_order
  ROAS <- revenue/spend
  market_share_lif <- conversion / marketing_campaign_df$Predicted_NYC_Collisions[12] * 100
  grs_prf <- revenue * gross_profit_margin
  net_prft <- revenue * net_profit_margin
  ret_on_inv <- net_prft/spend
  
  round(c(clicks, conversion, revenue, ROAS, market_share_lif, grs_prf, net_prft, ret_on_inv),2)
}

#create and display a dataframe for the three levels of spend
kable(data.frame(Campaign_KPIs = c('Clicks' , 'Conversion' , 'Revenue' , 'ROAS' , 
                                   'Market Share Lift' , 'Gross Profit' , 'Net Profit' , 'ROI'),
                 'USD200' = create_col(200),
                 'USD300' = create_col(300),
                 'USD400' = create_col(400))) %>% 
  kable_paper("hover", full_width = F, html_font = "Cambria") %>%
  column_spec(3, color =c("white"), background = "green")
Campaign_KPIs USD200 USD300 USD400
Clicks 62.70 94.04 125.39
Conversion 9.55 14.32 19.10
Revenue 22916.61 34374.92 45833.23
ROAS 114.58 114.58 114.58
Market Share Lift 0.09 0.14 0.18
Gross Profit 11229.14 16843.71 22458.28
Net Profit 3208.33 4812.49 6416.65
ROI 16.04 16.04 16.04

From the summary, it appears that spending $300 per month on digital advertising will allow us to get very close to our goal of 0.25% of the market share.

Effects of COVID-19 on NYC Vehicle Collisions

The first reported case of COVID-19 was confirmed in New York on March 1, 2020, while later reports suggest that the first case could have been as early as January. Researchers later estimated that at the time the first case was confirmed, as many as 10,700 New Yorkers had already contracted the virus (Lewis). By March 9, there were 16 confirmed cases in NYC alone and on March 16, Governor Andrew Cuomo announced public schools would be closed, effectively beginning the COVID-19 lockdown. This remained in effect until June 8, when the governor announced the first phase of reopening under safety protocols. By the end of 2020, researchers estimate that 44% of all metro New York residents had been infected. To date, over 35,000 deaths of New York citizens have been attributed to COVID-19, with at least another 5,000 listed as probable deaths (Health).

#filter data, create combined_df2
combined_df2 <- (crashes_df %>% 
                  select(-c(CRASH_DATE, CRASH.TIME, CONTRIBUTING.FACTOR.VEHICLE.1, CONTRIBUTING.FACTOR.VEHICLE.2,
                            CONTRIBUTING.FACTOR.VEHICLE.3, CONTRIBUTING.FACTOR.VEHICLE.4, CONTRIBUTING.FACTOR.VEHICLE.5,
                            VEHICLE.TYPE.CODE.4, VEHICLE.TYPE.CODE.5,
                            NUMBER.OF.PEDESTRIANS.INJURED, NUMBER.OF.PEDESTRIANS.KILLED,
                            NUMBER.OF.CYCLIST.INJURED, NUMBER.OF.CYCLIST.KILLED,
                            ON.STREET.NAME, CROSS.STREET.NAME, OFF.STREET.NAME)) %>% 
                  inner_join(vehicles_df %>% 
                               filter(!is.na(VEHICLE_ID)) %>% 
                               select(-c(CRASH_DATE, CRASH_TIME, VEHICLE_ID))
                             , by = 'COLLISION_ID') %>% 
                  inner_join(person_df %>% 
                               select(-c(UNIQUE_ID, CONTRIBUTING_FACTOR_1, CONTRIBUTING_FACTOR_2))
                             , by= c('COLLISION_ID'='COLLISION_ID', 'UNIQUE_ID'='VEHICLE_ID')) %>% 
                  filter((PED_ROLE %in% c('Driver'))) %>%  #drivers only
                  filter((CRASH_DATE >= '2019-01-01') & (CRASH_DATE < '2020-12-31')) %>% 
                   #only from 2017-01-01 to 2021-11-30
                  filter(PERSON_AGE > 14 & PERSON_AGE < 101) %>%
                  filter(PERSON_SEX == 'F' | PERSON_SEX == 'M') %>%
                  mutate(DAY = substr(CRASH_DATE,9,10)) %>%
                  mutate(TIMESTEP2 = as.period(interval(as.Date('2019-01-01'), CRASH_DATE)) %/% days(1)) %>% 
                   ####add TIMESTEP2
                  mutate(yr_mo_day = paste0(substr(CRASH_DATE,1,4),'-',substr(CRASH_DATE,6,7),'-',substr(CRASH_DATE,9,10))) %>% 
                   ######add yr_mo_day
                  select(COLLISION_ID, CRASH_DATE, DAY, TIMESTEP2, yr_mo_day, NUMBER.OF.PERSONS.KILLED, 
                         PERSON_AGE, PERSON_SEX) %>% 
                   ####ADD KILLED, DAY, TIMESTEP2, yr_mo_day
                  arrange(CRASH_DATE)
                )
kable(t(summary(combined_df2))) %>% kable_classic(full_width = TRUE, html_font = "Cambria", font_size = 14)
COLLISION_ID Min. :3822228 1st Qu.:4133700 Median :4211392 Mean :4213024 3rd Qu.:4289851 Max. :4479833
CRASH_DATE Min. :2019-01-01 1st Qu.:2019-05-16 Median :2019-09-21 Mean :2019-10-19 3rd Qu.:2020-02-15 Max. :2020-12-30
DAY Length:482737 Class :character Mode :character NA NA NA
TIMESTEP2 Min. : 0.0 1st Qu.:136.0 Median :263.0 Mean :291.5 3rd Qu.:409.0 Max. :729.0
yr_mo_day Length:482737 Class :character Mode :character NA NA NA
NUMBER.OF.PERSONS.KILLED Min. :0.000000 1st Qu.:0.000000 Median :0.000000 Mean :0.001483 3rd Qu.:0.000000 Max. :4.000000
PERSON_AGE Min. : 15.00 1st Qu.: 29.00 Median : 40.00 Mean : 41.65 3rd Qu.: 53.00 Max. :100.00
PERSON_SEX Length:482737 Class :character Mode :character NA NA NA

Explore Crash Data by Day, 2019-2020

daily_agg1 <- (combined_df2 %>% 
                   group_by(TIMESTEP2, yr_mo_day, DAY) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP2, yr_mo_day, DAY)
                 )
plot_ly(x=daily_agg1$yr_mo_day, y=daily_agg1$n, type='bar') %>%
  layout(title = "Crash Data by Day from 2019-2020", xaxis = list(title = 'Day'), yaxis = list(title = 'Number of Crashes'))
#create covid_df for COVID analysis
covid_df <- (combined_df2 %>% 
                   filter((CRASH_DATE >= '2019-01-01') & (CRASH_DATE < '2020-12-31')) %>% #only from 2019-01-01 to 2020-12-31))
                   mutate(covid = ifelse(CRASH_DATE >= '2020-03-16', 1, 0)) %>%
                   rename(fatalities = NUMBER.OF.PERSONS.KILLED) %>%
                   group_by(TIMESTEP2, yr_mo_day, DAY, covid, fatalities) %>% 
                   summarise(n = n(), .groups = 'drop') %>% 
                   arrange(TIMESTEP2, yr_mo_day, DAY, covid, fatalities)
)
head(covid_df)

Linear Regression with Indicator Variable

\[ Crashes = \beta_0 + \beta_1*Covid \]

We used COVID as an indicator variable and the number of daily crashes in NYC as the response variable. The base case when the variable covid=0 represents January 1, 2019 - March 15, 2020. When covid=1, this represents March 16, 2020 - Dec 31, 2020.

#linear regression with only covid indicator variable
covid_model1 <- lm(n ~ covid, data = covid_df)
summary(covid_model1)
## 
## Call:
## lm(formula = n ~ covid, data = covid_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -600.8 -230.8  102.7  245.5  708.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   601.80      13.77   43.71   <2e-16 ***
## covid        -369.04      21.73  -16.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 347.5 on 1062 degrees of freedom
## Multiple R-squared:  0.2135, Adjusted R-squared:  0.2128 
## F-statistic: 288.3 on 1 and 1062 DF,  p-value: < 2.2e-16

Both the intercept and the B1 values are statistically significant. We can conclude from this model that before COVID, there was an average of 602 crashes a day in NYC. This number dropped to 233 crashes a day after COVID. This shows a large and significant decrease in the number of overall crashes by 61.3%.

According to the National Highway Traffic Safety Administration from the US Department of Transportation, there are 13.6 million vehicle collisions each year in the United State and the average cost of vehicle repair per collision is $21,036 (Barnes et al.). Our research also indicates that 22 percent of collision claims involve vehicles that are totaled (Vallet). We assume that accidents not totaled will be potential business for our auto body repair center. We then conclude that the average accidents a day, removing totaled collisions, from 2019 though the beginning of COVID-19 is 470, while the average accidents a day after COVID-19 through the end of 2020 is 182.

We estimate 6,054,581 dollars in lost revenue for auto body shops in NYC post COVID. Assuming .14% market share for our fictitious auto body shop, lost revenue totals 8,476 dollars a day. Furthermore, we estimate that the total revenue lost during the COVID-19 lockdown in NYC from March - June 2020 is 712,019 dollars.

Future work

To build on this work, one can perform further data analysis to

  • Understand differences between traffic collisions that involve cyclists and those that do not. Research by Li and Zhao points to significant differences between cyclists and drivers experiences of vehicle collisions with cyclists experiencing 3 times as many collisions as drivers with fatal outcomes.
  • Determine features of vehicular and cycling collision hot spots.
  • Use the sophisticated formulae for traffic exposure from Regev et. al. (133) on new data sets to see if it applies in other places.
  • Perform more data exploration through advanced visualizations to pull out and analyze any other trends in the data.

Works Cited

Barnes, Stephen R. et al. “COVID-19 Lockdown and Traffic Accidents: Lessons from the Pandemic.” *Contemporary Economic Policy* 40.2 (2022) : 349–368. <https://onlinelibrary.wiley.com/doi/abs/10.1111/coep.12562>.
Bean, Travis. “The 2017 Shop Performance Survey.” 1 Nov. 2017. 16 Jul. 2022. <https://www.ratchetandwrench.com/articles/5258-the-2017-shop-performance-survey>.
He, Songtao et al. Inferring High-Resolution Traffic Accident Risk Maps Based on Satellite Imagery and GPS Trajectories.” (2021) : 11957–11965.
Health, NYC. “COVID-19:data–NYC Health.” 1 Apr. 2020. 17 Jul. 2022. <https://www1.nyc.gov/site/doh/covid/covid-19-data.page>.
Lam, Lawrence T. “Distractions and the Risk of Car Crash Injury: The Effect of Drivers’ Age.” *Journal of Safety Research* 33.3 (2002) : 411–419. <https://www.sciencedirect.com/science/article/pii/S0022437502000348>.
Lewis, Caroline. “10,700 New Yorkers Had Already Contracted the Virus.” 6 May 2020. 14 Jul. 2022. <https://gothamist.com/news/officially-nyc-had-one-covid-19-case-new-research-suggests-it-was-actually-10000>.
Li, Jintai, and Zhan Zhao. “Accident; Analysis and Prevention.” 167 (2022) : n. pag. <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8806026/>.
McCormick, Kristen. “2021 Paid Search Advertising Benchmarks for Every Industry.” 20 May 2022. 16 Jul. 2022. <https://www.wordstream.com/blog/ws/2021/10/13/search-advertising-benchmarks>.
“Motor Vehicle Collisions - Crashes.” 4 Jul. 2022. 4 Jul. 2022. <https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95>.
Motor Vehicles, New York State Department of. “Find a DMV Regulated Business.” 19 Jun. 2022. <https://process.dmv.ny.gov/FacilityLookup/?_ga=2.33023101.270841510.1656975631-932475252.1656975630&TSPD_101_R0=084c043756ab200049c13ac02223efa9441583bad77008d3b71790bcb79ba757be5b2aa10507b29f084bbc5efb143000172ce93dfec8420e0dc96ad9ed441e7f4a2448770215c0d5a6afcf5d46b82c8f7237478452ff354f386ddeb3a9b77d8a>.
Regev, Shirley, Jonathan J. Rolison, and Salissou Moutari. “Crash Risk by Driver Age, Gender, and Time of Day Using a New Exposure Methodology.” *Journal of Safety Research* 66 (2018) : 131–140. <https://www.sciencedirect.com/science/article/pii/S0022437517307600>.
Thompson, Karl. “What Is Normal?” 3 Sep. 2018. 4 Jul. 2022. <https://revisesociology.com/2018/09/03/what-is-normal/>.
Vallet, Mark. “Total-Loss Thresholds by State.” 16 Sep. 2021. 17 Jul. 2020. <https://www.carinsurance.com/Articles/total-loss-thresholds.aspx>.
Volovich, Kristina. “What’s a Good Clickthrough Rate? New Benchmark Data for Google AdWords.” 4 Jul. 2022. <https://blog.hubspot.com/agency/google-adwords-benchmark-data>.
Williams, Allan F. “Teenage Drivers: Patterns of Risk.” *Journal of Safety Research* 34.1 (2003) : 5–15. <https://www.sciencedirect.com/science/article/pii/S0022437502000750>.

  1. We assume, of course, that all of their customers have normal psychological patterns (Thompson)↩︎